Week 8 - Support Vector Machines¶

Dr. David Elliott¶

1.4. Support Vector Classifier (SVC)

1.5. Support Vector Machine (SVM)

Common Notation

  • $\mathbf{X}$ is a matrix containing all the feature values of all the observations
  • $n$ is the number of observations in the dataset
  • $\mathbf{x}_i$ is a vector of all the feature values (except the label) of the $i$th instance in the dataset.
  • $y_i$ is the label (desired model output) of the $i$th instance in the dataset.
  • $p$ is the number of features in the dataset
  • $\mathbf{x}_j$ is a vector of all the observations values of the $j$th feature in the dataset.

We can't always perfectly separate the data with a $p − 1$ dimensional hyperplane. To overcome this problem we could either:

  • Tweak the constraints on the hyperplane to allow some points to be misclassified (soft margin),
  • Transform the data to be separable by a hyperplane in another space (kernel method).

1.6. Support Vector Classifier (SVC) ¶

SVC's are a generalisation and extension of the maximal margin classifier so it can be applied to a broader range of cases1.

In practice they are more robust to individual observations and better classify most training observations than the Maximal Margin Classifier. This is because they take the approach it is better to missclassify some training examples in order to do a better job classifying the rest.

This is called a soft margin as it allows some violations by the training data by a small subset of training observation, not only on the wrong side of the margin, but wrong side of the hyperplane.

Notes

  • "Developed in the computer science community in the 1990s"2

We want to relax the following constraints when necessary:

$$ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \geq 1 \text{ for } y_i = 1, \\ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \leq -1 \text{ for } y_i = -1 $$

This can be done by introducing positive slack variables $\xi_i, i = 1, \ldots, n$ in the constraints5,6,10:

$$ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \geq 1 - \xi_i \quad \text{if} \ y_i = 1, \\ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \leq -1 + \xi_i \quad \text{if} \ y_i = -1, \\ \xi_i \geq 0 \quad \forall_i. $$

Notes

  • slack variable $\xi_1,..., \xi_n$ allow individual observations to be on the wrong side of the margin or hyperplane.
  • $\sum_i\xi_i$ is an upper bound on the number of training errors
  • $\xi_i$ tells us where the $i$th observation is located relative to the hyperplane; $\xi_i = 0$ being on the correct side of the margin, $\xi_i > 0$ being on the wrong side of the margin, and $\xi_i > 1$ on the wrong side of the hyperplane.
  • $\xi_1 = ... = \xi_n = 0$ is the maximal margin hyperplane optimisation.
  • test observations are classified as before, $f(x^*) = \beta_0 + \beta_1x^*_1 + ... + \beta_px^*_p$

Tuning Parameter (C)¶

To ensure there is a penelty, $C$, for relaxing the constraint, we can change our objective function to be minimised from $\frac{1}{2}||\mathbf{w}||^2$ to,

$$ \begin{align} {\text{minimise} \atop \mathbf{w}, b, \xi } & \quad \frac{1}{2}||\mathbf{w}||^2+C\sum\limits_{i=1}^n\xi_i \\ \text{subject to} & \quad y_i(\mathbf{w}^{\mathrm T}\mathbf{x}_i+b) \geq 1-\xi_i, \quad \xi_i \geq 0, \quad \forall_i. \end{align} $$

$C$ is a tuning parameter that controls the bias-variance trade-off1.

The strength of the regularization is inversely proportional to $C$, meaning a large $C$ has a higher penelty.

Notes

  • Alike to maximal margin classifiers, SVC's only rely on a few observations, those on the margin or those that violate the margin (Support Vectors) - if they are on the correct side of the margin they dont change the classifier. This does mean that they are robust to observations far away from the hyperplane.
  • When $C$ is large we have narrow margins rarely violated, but highly fit to the training data (low bias-high variance).
  • Coversely, when smaller it is less strict about missclassification errors, meaning the margin is wider (high bias-low variance).
  • Like most hyper-parameters, it is often chosen using cross-validation.

Extra

Neither the $\xi_i$ or their Lagrange multipliers appear in the Wolfe dual problem. This means we now have6:

$\text{max} L_D \equiv \sum_i^n\alpha_i - \frac{1}{2}\sum^n_{i,k}\alpha_i\alpha_ky_iy_k\mathbf{x}_i\cdot \mathbf{x}_k \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i\alpha_iy_i = 0$.

This also has the same solution as before:

$\mathbf{\hat w} = \sum\limits^{N_S}_{i=1}\alpha_iy_i\mathbf{x}_i$.

Also, sometimes $C$ is defined as $C = \frac{1}{\nu N}$, where $0 < \nu \leq 1$ controls the fraction of misclasified points during the training phase7.

1.7. Support Vector Machine (SVM) ¶

Aims to address the situation where the boundary between two classes is not linear.

Feature Engineering¶

We could consider enlarging the feature space to make the dataset linearly separable.

Example: We can see below that our $\mathbf{x}_1$ is not linearly separable but it is when we add in our second feature $\mathbf{x}_2 = (\mathbf{x}_1)^2$

Using quadratic, cubic or higher-order polynomial functions we can project our data onto a higher-dimensional space via a mapping function $\phi$ where they are linearly separable (using a linear SVM model in this new feature space).

Example

$\phi(\mathbf{x}_1, \mathbf{x}_2) = (\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3) = (\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}^2_1+\mathbf{x}^2_2)$

Figure 18: 2D Dataset into Separable 3D Feature Space

Gaussian Radial Basis Function2¶

We could instead use a "similarity function", such as a Gaussian Radial Basis Function (RBF),

$\phi_\gamma(\mathbf{x},\ell) = \exp(-\gamma||\mathbf{x}-\ell||^2)$.

This is a bell-shaped function which measures how much an instanse resembles a landmark, with the function varying from 0 (far away) to 1 (at the landmark).

Example: Below we set our landmarks to $x_1 = -2$ and $x_1 = 1$.

Using the example of $x_1=-1$ we can see it is a distance of 1 from the first landmark and 2 from the second.

If we set $\gamma = 0.3$ then our new features are:

$x_2 = \exp(-0.3 \times 1^2) \approx 0.74$

$x_3 = \exp(-0.3 \times 2^2) \approx 0.30$

In order to find the landmarks the simplist approach is just to create a landmark at each instance in the dataset.

Kernels¶

However, by using feature engineering to enlarge our feature space, the larger the number of features, the higher computational burden.

Instead it is common to enlarge the feature space using an extension of a SVC termed a Support Vector Machine, which uses kernels.

The Kernel trick can relies on the fact we can define our SVM in the form of inner products.

$$ L_D(\alpha_i) = \sum_i^n\alpha_i - \frac{1}{2}\sum_{i,k}^n\alpha_i\alpha_ky_iy_k\mathbf{x}_i^{\mathrm T}\mathbf{x}_k \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0, \ \sum_i^n\alpha_iy_i = 0. $$

Imagine we had a mapping $\phi$ which maps the data to some high dimensional Euclidean space

$$ \phi: \mathbb{R}^d \mapsto H, $$

then, we could do dot products between vectors after the mapping in $H$:

$$ L_D(\alpha_i) = \sum_i^n\alpha_i - \frac{1}{2}\sum_{i,k}^n\alpha_i\alpha_ky_iy_k\phi(\mathbf{x}_i^{\mathrm T})\phi(\mathbf{x}_k) $$

Instead we could use a kernel function,

$$ K(\mathbf{x}_i,\mathbf{x}_k) = \phi(\mathbf{x}_i^{\mathrm T})\Phi(\mathbf{x}_k). $$

Notes

  • Roughly speaking, a kernel can be interpreted as a similarity function betwen pairs of samples4.
  • $\mathbf{a} \in \mathbb{R}^k$ just means it is a vector of length $k$.
  • $\mathbf{A} \in \mathbb{R}^{r\times s}$ just means it is an $r \times s$ matrix.
  • $\mathbf{w}$ lives in $H$ ("high dimensional")
  • $H$ is often refered to as a Hilbert space.
  • "it is clear that the above implicit mapping trick will work for any algorithm in which the data only appear as dot products (for example, the nearest neighbor algorithm). This fact has been used to derive a nonlinear version of principal component analysis by (Scholkopf, Smola and Muller, 1998b); it seems likely that this trick will continue to find uses elsewhere." Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.

Second-Degree Polynomial Example2

Suppose we wanted to apply a second-degree polynomial transformation to two 2D vectors, $\mathbf{a}$ and $\mathbf{b}$, and then compute the dot product. We could do this by:

$$ \phi(\mathbf{a})^{\mathrm T}\phi(\mathbf{b}) = \begin{pmatrix} {a_1}^2 \\ \sqrt{2a_1a_2} \\ {a_2}^2 \end{pmatrix}^{\mathrm T} \begin{pmatrix} {b_1}^2 \\ \sqrt{2b_1b_2} \\ {b_2}^2\end{pmatrix} = {a_1}^2{b_1}^2+2a_1b_1a_2b_2+{a_2}^2{b_2}^2. $$

Instead we could use the kernel approach:

$$ K(\mathbf{a}, \mathbf{b}) = (\mathbf{a}^{\mathrm T}\mathbf{b})^2 = \begin{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}^{\mathrm T}\begin{pmatrix} b_1 \\ b_2 \end{pmatrix}\end{pmatrix}^2 = (a_1b_1+a_2b_2)^2 = {a_1}^2{b_1}^2+2a_1b_1a_2b_2+{a_2}^2{b_2}^2. $$

This is useful as it means we didnt have to map our data using $\phi$ first. This saves us time!

Notes

  • A second-degree polynomial for a training set with two features, $\mathbf{x} \in \mathbb{R}^2$, would be:
$$\Phi(\mathbf{x}) = \Phi((\mathbf{x}_1, \mathbf{x}_2)) = ({\mathbf{x}_1}^2, \sqrt{2\mathbf{x}_1\mathbf{x}_2}, {\mathbf{x}_2}^2)$$
Polynomial Feature Engineering (degree=3)
1.49 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Polynomial Kernel (degree=3)
1.14 ms ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We still have all the same considerations, but replacing $\mathbf{x}_i^{\mathrm T} \mathbf{x}_k$ with $K(\mathbf{x}_i, \mathbf{x}_k)$ allows us to produce a SVM in infinite dimensional space.

$$ \sum_i^n\alpha_i - \frac{1}{2}\sum_i^n\sum_k^n\alpha_i\alpha_ky_iy_kK(\mathbf{x}_i, \mathbf{x}_k) \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0, \ \sum_i^n\alpha_iy_i = 0. $$

In the test phase, we can use the support vectors6:

$$ \begin{align} f(\mathbf{x}^*) &= \sum_{i \in s}\hat\alpha_iy_i\phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}^*) + \hat b \\ &= \sum_{i \in s}\hat\alpha_iy_iK(\mathbf{x}_i,\mathbf{x}^*) +\hat b, \end{align} $$

avoiding computing $\Phi(\mathbf{x}^*)$.

Note

  • Put another way, in the test phase6, we can use the support vectors, $\mathbf{s}_i$:
$$ \begin{align} f(\mathbf{x}^*) &= \sum_{i = 1}^{n_s}\hat\alpha_iy_i\Phi(\mathbf{s}_i)\cdot \Phi(\mathbf{x}^*) + \hat b \\ &= \sum_{i = 1}^{n_s}\hat\alpha_iy_iK(\mathbf{s}_i,\mathbf{x}^*) +\hat b, \end{align} $$

Polynomial kernels¶

The polynomial kernel of degree $d$ (a positive integer) can be defined as:

$$K(\mathbf{x}_i,\mathbf{x}_k) = \left(\gamma \left<\mathbf{x}_i, \mathbf{x}_k\right> + r \right)^d$$

Hyperparameters¶

  • $r$ (coef0) roughly controls how much the model is influenced by high-degree polynomials2
  • [INSERT]

Notes

  • you will sometimes see a kernel defined in a general sense using $K(\mathbf{x},\mathbf{x}^\prime)$, where $\mathbf{x}^\prime$ means another point in the dataset
  • $\left<x_i, x_j\right> = x_i \cdot x_j = \mathbf{x}_i^{\mathrm T} \mathbf{x}_k$
  • $\gamma$ is often 1
  • $d$ is a positive integer
  • If you start to overfit you should reduce the polynomial degree and underfitting try increasing it.
  • in Scikit-Learn, $\gamma$ is specified by parameter gamma, $d$ by degree, and $r$ by coef0.

TODO

  • improve the visualisations so they use the nice one like at the end.
  • Do a box where you keep gamma and c the same but alter d and r
  • Describe the hyperparameter settings more

Radial Basis Function kernel¶

The most widely used kernel is the RBF kernel (also known as a Gaussian kernel)4,8:

$$ K(\mathbf{x}_i,\mathbf{x}_k) = \exp\left(-\gamma||\mathbf{x}_i-\mathbf{x}_k||^2\right) $$

where $\gamma$ is a free parameter to be optimised and $||\mathbf{x}_i-\mathbf{x}_k||^2$ is the squared Euclidean distance.

$\gamma$ is often either a positive constant or $\frac{1}{2\sigma^2}$

When classifying a test observation $\mathbf{x}^* = (x^*_1...x^*_p)$, only training observations close to $\mathbf{x}^*$ (in terms of Euclidean distance) will play a role in its class label. This is because $(x^*_j-x_{ij})^2$ will be large, so $\exp(-\gamma\sum^P_{j=1}(x^*_j-x_{ij})^2)$ will be small1.

TODO

  • fix notation

Notes

Euclidean distance8

$$ d(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} \sqrt{\left(x_i^{(1)}-x_k^{(1)}\right)^2+\left(x_i^{(2)}-x_k^{(2)}\right)^2 + \ldots + \left(x_i^{(N)}-x_k^{(N)}\right)^2} = \sqrt{\sum_{j=1}^D\left(x_i^{(j)}-x_k^{(j)}\right)^2} $$

Hyperparameters2¶

$\gamma$ is effectively acting like a regularization hyperparameter, so like $C$ if your model is overfitting reduce it and underfitting then increase itREF.

  • Increasing $\gamma$ (gamma) makes the bell-shaped curve narrower.

    • Each instances range of influence is smaller.
    • The decision boundary becomes more irregular.
  • Decreasing $\gamma$ makes the bell-shaped curve wider.

    • Instances have a larger range of influence.
    • The decision boundary becomes smoother decision.

$C$ and $\gamma$ are tightly coupled.

Generally if you have a larger/narrow gamma (e.g. $\gamma = 5$) you'll need less regularisation, so a smaller $C$.

If you have a smaller/wider gamma (e.g. $\gamma = 1$), a larger value of $C$ should be used7.

Notes

  • We will see this in effect in the next notebook

Associated Exercises¶

Now might be a good time to try exercises X-X.

References¶

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
  2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  3. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".
  4. Raschka, 2016
  5. Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20:273–297, 1995
  6. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
  7. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  8. Burkov, A. (2019). The hundred-page machine learning book (Vol. 1). Canada: Andriy Burkov.
  9. Zhang, J. (2015). A complete list of kernels used in support vector machines. Biochem. Pharmacol.(Los Angel), 4, 2167-0501.
  10. Python Machine Learning

web1. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html web2. https://scikit-learn.org/stable/datasets/toy_dataset.html